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Abstract 

The sampling problem in training corpus is one of 
the major sources of errors in corpus-based 
applications. This paper proposes a corrective 
training algorithm to best-fit the run-time context 
domain in the application of bag generation. It 
shows which objects to be adjusted and how to 
adjust their probabilities. The resulting techniques 
are greatly simplified and the experimental results 
demonstrate the promising effects of the training 
algorithm from generic domain to specific domain. 
In general, these techniques can be easily extended 
to various language models and corpus-based 
applications. 

Keywords: Adaptive Learning, Bag Generation, 
Corpus, Corrective Training, Language 
Modeling. 

1. Introduction 

In corpus-based applications, most of the errors are 
caused by two major sources. One is the power of 
language models, and the other one is the sampling 
problem in training corpus. One of the possible 
ways to avoid the former type of errors is to enhance 
the weaker language models. The latter type of 
errors results from the small corpus size and the 
variant run-time context domain. Small corpus will 
produce zero and unreliable probabilities in the 
training tables. Some smoothing techniques 
(Jehnek and Mercer, 1980; Jehnek, 1985; Katz, 
1987) have been proposed to deal with this problem. 
They provide static adjustments of unreliable 
probabilities. Nevertheless, these methods cannot 
handle the run-time status of the context domain. 
Dynamic models such as cache-based model (Kuhn 
and Mori, 1990) and multiple language model 
(Matsunaga, et al., 1992) touch on run-time 
behavior. Cache-based model reflects short-term 
patterns of words, so that it is effective for repeated 



expressions. However, this approach is still very 
intuitive and simple because it only adjusts the word 
frequencies in run-time, and does not revise the 
statistical information in the long-term memory. 
Multiple language model is based on several 
corpora of different fields. Basically, a small 
amount of similar text are imported and 
interpolated with the original texts, when the 
context domain is presented. The extra cost of this 
approach is the context determination. The 
similarity measures among test sentences and the 
pre-defined context domains may introduce 
additional errors. 

In this paper, we would not like to touch on 
the power of language models. We focus on the 
sampling problem in training corpus. A corrective 
training algorithm, which can be also regarded as a 
dynamic adaptive learning algorithm, is proposed 
for bag generation. It exploits the run-time 
feedback information to best-fit the run-time 
environment. That is, when error occurs, the error 
result will be corrected by users. Through the 
modification, the system learns and adapts. It 
learns the differences between the correct result and 
the error result. These form the useful run-time 
feedback information. In other words, the system 
learns from the mistakes it makes. Under this way, 
we first propose a language model to deal with the 
sentence generation, i.e., bag generation, problem 
and a generic corpus is used to extract the 
corresponding statistics information. Then the 
training algorithm will try to adapt the generic 
language model into a specific one according to the 
useful run-time feedback information. At the same 
time, the probabilities of the related entries in 
training table are adjusted. In the following 
sections we first introduce the bag generation 
algorithm, then describe the adaptive learning 
model for bag generation. Before concluding we 



A Corrective Training Algorithm for Adaptive Learning in Bag Generation 



demonstrate the experimental results of this 
corrective training algorithm. 

2. Bag Generation Algorithm 

Bag generation (Brown, et al., 1990; Chen and Lee, 
1993a) is a natural language generation method. It 
can be applied to develop a generator in a 
statistically based machine-translation system 
(Brown, et al, 1990; Chen and Lee, 1993b). In bag 
generation we take a sentence, divide it into words, 
place the words in a bag, and then try to recover the 
sentence given the bag. That is, given a bag of n 
words, it tries to find a permutation p such that the 
word sequence <*,WpQ\, Wp/2), ■■■^^p(ri)*^ 
denotes the correct sentence. The symbol * marks 
the beginning (wp/Q\) and ending (wp(„+j\) of a 
sentence. In Markov word m-gram model, the 
permutation p is defined by the following formula. 

/>=argmax P(w^J * /'(w^„lw^„,) * ... * 
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Intuitively, a bag generation algorithm can first 
generate all permutations of words, and then select 
the permutation with the greatest probability. 
However, the computational time cannot be 

endured^ when the number of words in the bag is 
large. Here, a dynamic programming technique is 
adopted. For any two sequences. 

Sequence 1: *,Wp'(j^, Wp'(^2)' •••'Wp'(/) l<i<n 

Sequence 2: *,v/p"(iy Wp"(^2)' •■•'^p"(/) 1 - 7 - « 

the merge operation can be applied in these two 
sequences under the following four conditions if a 
Markov word m-gram model is used. 

(1) The sequence length should be longer than 
m-\, i.e., i>m-\ andj>m-l. 

(2) The lengths of these two sequences should be 
equal, i.e., i=j. 

(3) The last m-\ words in these two sequences 
should be equal, i.e., 

^p\k)=^p"{k) for i-{m-\)+\<k<i. 

(4) These two sequences should cover the same 

words, i.e., "^ p' (x)=^ p" (y) for \<x,y<i. 

The merge operation retains the sequence with 
greater probability, and discards the sequence with 
smaller probability. The following proposition that 



clarifies this point for a Markov word m-gram 
model. 

Proposition. The merge operation can be applied 
in any two sequences under the following four 
conditions, if a Markov word m-gram model is 
adopted. 

(1) The sequence length should be longer than 
m-\. 

(2) The lengths of these two sequences should be 
equal. 

(3) The last m-\ words in these two sequences 
should be equal. 

(4) These two sequences should cover the same 
words. 

Proof: 

The first three are the basic conditions of a 
Markov m-gram model. In this model, the system 
uses the last m- 1 words to predict the probability of 
the current word. Let the probabilities of two 
sequences Hj and H2 be P(Hj) and P(H2), and 

P(Hj)>P(H2). When the next word w„(n>m-l) is 

read, their probabilities become 
P(Hi)*P(w„lwi(„.^+l), ..., wi(„.i)) and 
P(H2)*P(w„lw2(„.^+l), ..., W2(„.i)), respectively. 

If the last m-\ words are the same, i.e., wj/^. 

m+l)=W2(M-m+l)' ■■■' wi(„.i)=W2(„.i), then the 
former is still larger than the latter. However, if the 
last m- 1 words of these two sequences are not the 
same, then the former may be smaller than the latter. 
Thus, merging may preserve the sequence with 
smaller probability and may introduce erroneous 
results. 

In fact, the first three conditions are enough 
for the other Markov-based applications such as 
phone-to-text transcription, etc. However, there is a 
problem in bag generation , if we do not obey the 
last condition. Consider a general case. Let the two 
sequences Hj and H2 have the following forms. 

Hj: wjQ, wji, ..., ^li^n-my ^{n-m+\y -' ^{n-\) 

H2: W20, W2 1 , . . ., ^2{n-my ^{n-m+ 1 )' • ■ •' ^{n- 1 ) 

If {wjo, wji, ..., w^„.^^} is not equal to {W20, 

W2J, ..., W2(^.^)}, there must exist some wj^- and 

W2y such that Wj,:?!:w2;. If P(Hj)>P(H2), then the 

word sequence involving wj^-, i.e.. 



^20' "^21' 



W'^ 



2(n-my ^(n-m+\y 



w, 



(n-1)' Wh' 



1 



Its time complexity is 0(n!). 
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Table 1. Statistics of the Outside Test Data 



category 


total sentences 


total words 


extracted sentences 


extracted words 


A 


192 


1151 


108 


457 


G 


216 


1166 


156 


685 


J 


323 


1712 


229 


981 


N 


307 


1384 


252 


959 



Table 2. Experimental Results of Outside Test 



category 


approach 1 (optimal solution) 


approach 2 (near optimal solution) 


sentence correct rate 


word correct rate 


sentence correct rate 


word correct rate 


A 


40.74% 


49.45% 


34.26% 


38.07% 


G 


29.49% 


44.23% 


22.44% 


32.26% 


J 


34.50% 


43.53% 


30.13% 


37.21% 


N 


42.86% 


49.11% 


34.13% 


38.48% 


average correct rate 


37.18% 


46.30% 


30.47% 


36.63% 



becomes neglected. This sequence may have higher 
probability, so that error occurs. ■ 

A newspaper corpus which includes 350775 
sentences (2461178 words) is adopted as the source 
of the training data. It contains texts of several 
categories. Therefore, it can be regarded as a 
general corpus. The symbol * is added to the 
beginning and ending of all sentences. In this 
paper, a Markov word bigram model is considered 
in bag generation to generate Chinese sentences. 
With the Markov word bigram model, 2811953 
total pairs and 905470 distinct pairs are extracted 
from this corpus. For outside test, four documents 
are selected from NTU Corpus, which is a Chinese 
balanced corpus. They belong to categories A 
(reportage), G (belles lettres), J (learned) and N 
(adventure). The statistics of these documents are 
shown in Table 1. A subset of sentences, which 
have 1-6 words, are selected from these documents. 
Columns 4 and 5 in Table 1 denote the statistics of 
these data. Table 2 demonstrates the experimental 
results of testing the extracted sentences, i.e., length 
1-6. 

Approach 1 uses all the four conditions in 
Algorithm 1, but approach 2 only uses the first 
three conditions. Two criteria, i.e., sentence correct 
rate and word correct rate, are applied to evaluating 
these two approaches. The former denotes how 
many sentences are reproduced correctly, and the 
latter denotes how many words occupy the correct 
positions. Approach 1 has better performance than 
approach 2 in these two aspects. However, 
approach 2 is more efficient than approach 1 . The 
average performance of these two approaches is not 
good enough because of the small training corpus. 



The document of category A is selected from 
newspapers, so that the performance of processing 
this document is better than that of categories G and 
J. 

3. The Corrective Training 
Algorithm 

In corrective training, two major issues should be 
considered: (1) Which object should be adjusted? 
and (2) How many probabilities will be reassigned 
to the object? These problems depend on language 
models and applications. This paper focuses on bag 
generation with Markov word bigram model. The 
permutation p is defined by formula(l). Formula (2) 
is derived further from formula (1). 



P=argmaxf(w^O) 



)*n^(>-/<m)iw^,)) 



(1) 



;=0 



: argmax 
p 



f(*,W^l))*f(w^l),W^2))*...*f(w^„),*) 
/'(W^l))*nw^2))*...*nw^„)) 



= argmax P(*,w^„) * P{w^,^,w^^^) * ... * 

P^^M."") .... (2) 

The denominators of all the permutations are equal, 
so that they can be neglected and only the 
probabilities of adjacent words are used instead of 
the original conditional probabilities. Consider two 
word strings D = <*, wj, W2, ..., w^, *> and C = 

<*,Wp/j\, Wp/2), ■■■'Wp(^^\,*>. They correspond to 

the desired result and the final computed result, 
respectively. If C is the same as D, then no 
adjustment is required. Otherwise, Algorithm 1 
finds the word pairs that may have to be adjusted. 
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Algorithm 1. FindPair(<*, wj, W2, ..., w^, *>,<*, Wp/j\, Wp/2), ■■■, ^p(n)^ *>) 

for i = to n do mark(wp/;\)=false 

for i = to n do 
begin 

/oMnfif=false 

7=0 

while ((/ < n) and {not found)) do 

begin 

if ((W; = Wp/;\) and (not mark{^Qn\))) then 

begin 

found=true 
mark(wp^;\)=true 

if (w(/+l) ^ wp(;+i)) thenAc(/-MS/(<W;,W(;+i)>,<Wp^-),Wp^-+l)>) 

end 

else7=; + 1 
end 
end 



Algorithm 2. Adjust(OrderedPair,DisorderedPair) 

" ~ "OrderedPair ' " DisorderedPair 
if (AP = 0) then 
begin 

^^"^^OrderedPair = ^ * (^OrderedPair + ^^o^'' ^«^««) 

^^"^^ DisorderedPair = ^ * (^DisorderedPair " F^^^r Value) 

^^OrderedPair = ^^"^^ Order edP air " ^OrderedPair 

"DisorderedPair ~ ^^^" DisorderedPair ' "DisorderedPair 
end 

else if (AP < 0) then 
begin 

^^"^^OrderedPair = ^ * (^OrderedPair - Pi * AP) 
^^"^^ DisorderedPair = ^ *' (^DisorderedPair +h*' ^P) 
"OrderedPair ~ ^^^" OrderedPair ' "OrderedPair 

"DisorderedPair ~ ^^^" DisorderedPair ' "DisorderedPair 
end 



Assume D = <*, wj, W2, W3, W4, w^, *> and 

C = <*, W3, W4, W5, wj, W2, *>. Three suspicious 

tuples, i.e., (<*,wj>,<*,W3>), (<W2,W3>, <W2,*>) 

and (<W5,*>, <W5,wj>), are identified by 

Algorithm 1 . Because the same word may be used 
more than one time in the bag, the mark flags 
guarantee that the same pair cannot appear in more 
than one tuple. The first pair in each suspicious 
tuple is called the ordered pair, and the second pair 
is called the disordered pair. By formula (2), if the 
probability of the ordered pair in each suspicious 



tuple is larger than that of the disordered pair, then 
the computed result would be the desired result. 

Algorithm 2 adjusts those pairs whose 
probabilities do not satisfy the above condition. 
Two adjustments, i.e., ^^OrderedPair ^nd 
^^DisorderedPair ^^e computed to add some 
probabilities to the ordered pairs and subtract some 
probabilities from the disordered pairs. By this way, 
the desired result will have higher probabilities than 
the final computed error result. 
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Algorithm 3. CorrectiveTraining(S\, S2, S3, ..., S^) 

while not {one of the stopping criteria is met) do 
begin 

for ; = 1 to ffj do 

begin 

Let S; be <*, wj, W2, ..., w„, *> 

p =formulal{<*, wj, W2, ..., w^, *>) 

if <*, wp(i), wp(2), ..., wp(„), *> ^ <*, WJ, W2, ..., w„, *> 

then FindPair(<* , wj, W2, ..., w„, *>,<*, Wp/j\, Wp/2), ■■■, Wp/„\, *>) 

end 

for each (ordered or disordered) pair do 
begin 

Compute the Average Adjustments ^Ppair of This Pair 

if Ppair + ^Ppair < then Pp^i^ = 0.00001 else Pp^ir = Ppair + ^Ppair 
end 
end 



In Algorithm 2, a is the scaling factor. It is 
often set to 1. Pj and P2 (0<Pi, P2<1) are the 
learning rates of the ordered pairs and the 
disordered pairs, respectively. The sum of Pj and 

P2 must be greater than 1. In general, Pj and P2 

are used to control the distance between the ordered 
pair and the disordered pair after adjustment. The 
distance 5 is equal to (Pj+P2-l)*afes(AP). The 

function abs(AP) computes the absolute value of AP. 
Assume an ordered pair OP and a disordered pair 
DP have probabilities 0.1 and 0.6, respectively. Pj^ 

and P2 are set to 0.6 and 0.5, respectively. After 
adjustment, the probabilities of OP and DP become 
0.4 and 0.35, respectively. Their difference 5 is 
0.05. 5 is an important factor for a robust language 
model, and it highly depends on Pj and P2. 

Obviously, if Pj^ and P2 are all set to 0, then no 

feedback information is used in this algorithm-^. 
On the contrary, if Pj and P2 are all set to 1, then 
the probabilities of the ordered pair and the 
disordered pair are exchanged mutually. Pi>P2 
(Pj<P2) means Algorithm 2 emphasizes on the 
positive (negative) feedback information. 

Algorithm 3 shows a complete corrective 
training algorithm for bag generation. M sentences. 



^The floor value is ignored in current discussions. 



Sj, S2, S3, ..., S^, are used for corrective training. 

This algorithm checks whether the computed result 
by using formula (1) is correct or not. If it is not, 
this algorithm adjusts the probabilities of the 
ordered pairs and the disordered pairs. Because a 
pair may be adjusted more than one time, the 
average of all its adjustments is computed to avoid 
the overtune problem. For example, if there are two 
adjustments Al and A2 for the same pair, then the 

final adjustment for this pair will be - ■ 

Moreover, if the sum of the original and the 
adjustment probability is less than zero, then the 
probability of this pair will be set to a very small 
value, i.e., 0.00001. Besides, a new pair with 
negative or zero probability will not be allowed to 
add into the training table. These average 
adjustments are fed into the old training table, and a 
new training table is formed. This algorithm is 
repeated until one of the stopping criteria is met. In 
practice, several criteria can be considered. 
Firstly, it is based on the magnitude of gradient of 
error (GE) shown as follows; 

m 
i=l 

where y- = ) abs(adjustment in ordered pair ,■) + 

abs(adjustment in disordered pair •) 
if Si has k suspicious tuples. Otherwise, y=0. 

GE specifies whether the learning direction is 
correct or not. Clearly, if no adjustment is 
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performed, then GE is equal to zero. Therefore, gradient of error is sufficiently small. Secondly, the 

Algorithm 3 is terminated when the magnitude of algorithm stops as soon as all the tests are correct. 

Table 3. Experimental Results by Using Approach 1 and the Corrective Training Algorithm 



stage 


performance of part 1 


performance of part 2 


performance of part 3 


average correct rate 


sentence 
level 


word 
level 


sentence 
level 


word 
level 


sentence 
level 


word 
level 


sentence 
level 


word 
level 





40.26% 


46.88% 


39.47% 


45.93% 


23.68% 


37.98% 


34.50% 


43.53% 


1 


100.0% 


100.0% 


44.74% 


49.51% 


30.26% 


41.54% 


58.52% 


64.12% 


2 


93.51% 


94.36% 


100.0% 


100.0% 


32.89% 


43.92% 


75.55% 


78.80% 


3 


90.91% 


91.39% 


98.68% 


98.37% 


100.0% 


100.0% 


96.51% 


96.53% 


Table 4. Experimental Results by Using Approach 2 and the Corrective Training Algorithm 


stage 


performance of part 1 


performance of part 2 


performance of part 3 


average correct rate 


sentence 
level 


word 
level 


sentence 
level 


word 
level 


sentence 
level 


word 
level 


sentence 
level 


word 
level 





26.85% 


31.39% 


25.93% 


30.88% 


15.89% 


25.00% 


22.91% 


29.03% 


1 


72.22% 


74.25% 


27.78% 


33.03% 


21.50% 


30.27% 


40.56% 


45.74% 


2 


64.81% 


67.20% 


62.04% 


61.76% 


20.56% 


31.80% 


49.23% 


53.27% 


3 


65.74% 


66.31% 


60.19% 


58.53% 


59.81% 


60.37% 


61.92% 


61.74% 



i.e., no suspicious tuples are generated. Thirdly, the 
algorithm stops when no feedback information is 
obtained, i.e., no adjustments in all the pairs. 
However, it does not mean the performanceachieves 
100%. This is because some errors are caused by 
the power of language model. 

4. Experimental Results 

In order to demonstrate the effect of the corrective 
training algorithm in different context domains, the 
document of category J, i.e., a technical paper, is 
selected in the experiment. At first, the extracted 
sentences of length 1-6 are partitioned into three 
parts (76, 76, 77). At stage 1, part one is used to do 
the corrective training, and parts two and three are 
used to test the performance. At stage 2, part two is 
sent to corrective training, and parts one and three 
test the performance. Finally, we apply the 
corrective training to part three, and use the other 
two parts to test the performance at stage 3. Table 3 
shows the results by using approach 1 and the 
corrective training algorithm. 

On the one hand, the above experiment shows 
this algorithm has good generalization. When we 
continue the corrective training on the subsequent 
part(s), the performance of the preceding part(s) 
remain very high. On the other hand, when the 
number of test sentences from the specific context 
domain increase, the performance of the subsequent 
test is improved significantly. 



Next, approach 2 and the corrective training 
algorithm are applied to the complete document J. 
The three parts have 108 (567), 108 (557) and 107 
(588) sentences (words), respectively. The results 
are shown in Table 4. Approach 2 is a near optimal 
bag generation so that incomplete feedback 
information decreases the power of the adaptive 
learning model. Hopefully, bag generation can be 
coupled with other modules in practical applications, 
e.g. parser in machine translation systems. Parser 
can partition a bag into several smaller bags (Chen 
and Lee, 1993b). In this way, the effectiveness is 
not a problem, and approach 1 (optimal bag 
generation) can be adopted. 

5. Concluding Remarks 

This paper proposes an corrective training 
algorithm for task adaptation to best-fit the run-time 
environment in the application of bag generation. It 
controls the distance of the ordered pairs and the 
disordered pairs in the suspicious tuples. The 
resulting techniques are greatly simplified and 
robust. They give improved performance. 

Although this adaptive learning algorithm is a 
greedy algorithm, i.e., linear gradient search 
algorithm, that seeks out a local optimization result, 
it still has strong probability to achieve the global 
optimization result because it starts with a good 
initial state, i.e., initial training table. Besides, this 
corrective training algorithm is also suitable for 
incremental training. Initially, training table can be 
generated from a generic corpus. If the test 



Chen and Lee 



sentences come from other specific domains, this 
algorithm automatically revises the old training 
table and produces a specific training table. In 
general, these techniques can be easily extended to 
various language models and corpus-based 
applications. 

In this paper, we assign each parameter, i.e., 
a, Pj, ^2 ^nd floor value, used in the corrective 

training algorithm a constant value. However, is 
the procedure stable for all values of these 
parameters, or are there universal values of these 
parameters? The parameter-setting problem is 
important and needs to further investigate. 
Moreover, a more robust corrective training 
algorithm, e.g., non-linear corrective training 
algorithm, is also demanded in the future. 
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